Pos-tagging different varieties of Occitan with single-dialect resources
نویسندگان
چکیده
In this study, we tackle the question of pos-tagging written Occitan, a lesser-resourced language with multiple dialects each containing several varieties. For pos-tagging, we use a supervised machine learning approach, requiring annotated training and evaluation corpora and optionally a lexicon, all of which were prepared as part of the study. Although we evaluate two dialects of Occitan, Lengadocian and Gascon, the training material and lexicon concern only Lengadocian. We concluded that reasonable results (> 89% accuracy) are possible with a very limited training corpus (2500 tokens), as long as it is compensated by intensive use of the lexicon. Results are much lower across dialects, and pointers are provided for improvement. Finally, we compare the relative contribution of more training material vs. a larger lexicon, and conclude that within our configuration, spending effort on lexicon construction yields higher returns.
منابع مشابه
A Resource for Natural Language Processing of Swiss German Dialects
Since there are only a few resources for Swiss German dialects, we compiled a corpus of 115,000 tokens, manually annotated with PoStags. The goal is to provide a basic data set for developing NLP applications for Swiss German. We extended the original corpus and improved its annotation consistency. Furthermore, we trained dialect-specific PoS-tagging models and implemented a baseline system for...
متن کاملAn improved joint model: POS tagging and dependency parsing
Dependency parsing is a way of syntactic parsing and a natural language that automatically analyzes the dependency structure of sentences, and the input for each sentence creates a dependency graph. Part-Of-Speech (POS) tagging is a prerequisite for dependency parsing. Generally, dependency parsers do the POS tagging task along with dependency parsing in a pipeline mode. Unfortunately, in pipel...
متن کاملMorphological features help POS tagging of unknown words across language varieties
Part-of-speech tagging, like any supervised statistical NLP task, is more difficult when test sets are very different from training sets, for example when tagging across genres or language varieties. We examined the problem of POS tagging of different varieties of Mandarin Chinese (PRC-Mainland, PRCHong Kong, and Taiwan). An analytic study first showed that unknown words were a major source of ...
متن کاملVenetan to English Machine Translation: Issues and Possible Solutions
In this paper we describe a prototype of a Venetan to English translation system developed under the Stilven project financed by the Regional Authorities of Veneto Region in Italy. The general approach is a statistical one with some preprocessing operations both at training and translation time (ortographic normalization and POS tagging to make use of factored models) which are needed especiall...
متن کاملArabic Dialect Processing Tutorial
The existence of dialects for any language constitutes a challenge for NLP in general since it adds another set of variation dimensions from a known standard. The problem is particularly interesting and challenging in Arabic and its different dialects, where the diversion from the standard could, in some linguistic views, warrant a classification as different languages. This problem would not b...
متن کامل